GAMMs are not time series models (!)

This title is a bit provocative, yet strictly speaking it is correct. We will show two examples where we realise what is missing from GAMMs in order to become tools for time series analysis. In short, we need to communicate to the model something about the order of the time samples, which we have not done so far.

Consider the data below:

Let’s estimate a GAM:

m1 <- bam(y ~ s(time, k = 20), data = curves)
summary(m1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(time, k = 20)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.1795848  0.0004607   389.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df    F p-value    
## s(time) 15.52   17.6 6060  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.915   Deviance explained = 91.5%
## fREML = -16489  Scale est. = 0.0021116  n = 9950
plot_smooth(m1, view = "time", col = "slateblue4", print.summary = FALSE, rug = FALSE)

Now, imagine that the same data set was not a collection of time series. For example instead of time we could have price of houses (in millions of euros), and y could be some measure of demand. The plot would be like this:

Question: How would you change the GAM above to reflect the fact that there are no curves but it is all one data set?

AR1 residuals

gam.check(m1)

## 
## Method: fREML   Optimizer: perf newton
## full convergence after 9 iterations.
## Gradient range [-6.31743e-06,6.479967e-06]
## (score -16488.76 & scale 0.002111591).
## Hessian positive definite, eigenvalue range [8.040793,4974.011].
## Model rank =  20 / 20 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##           k'  edf k-index p-value
## s(time) 19.0 15.5    1.02    0.89

We note a strong evidence for heteroscedasticity. Where does it come from?

Look at the variation in the large peak. When y is high we are around the peak, and that is also where the largest variation across curves occurs, hence the wider residual span for higher fitted values.

Now let’s consider one particular curve, say one whose first peak is higher than the mean. As we see from the raw curves, we expect that if the residual error for this particular curve at say time = 0.4 is positive, it will also be positive in the next time samples. Hence neighbouring residuals are correlated, which violates one of the hypotheses of the model. We would like to express this fact in the GAMM.

A way to alleviate the problem above is to induce a structure in the residuals as follows: \[ \epsilon_i = \rho \cdot \epsilon_{i-1} + \psi_i \] which means that the residual at point \(i\) is proportional to the residual at point \(i-1\), i.e. one step earlier on the time axis, plus another unknown term \(\psi_i\), the latter values distributed as \(N(0, \sigma^2)\) and independent. The GAMM will not estimate the parameter \(\rho\) for us directly, rather we meed to set it ourselves. This type of sub-model for the noise is well known in signal processing and it’s called AR1, i.e. auto regressive model of order 1 (because it only depends on one step in the past).

What we typically do is this:

  1. Estimate a GAMM without AR1 term (like above)
  2. Compute autocorrelation of residuals
  3. Re-run the GAMM setting \(\rho\) equal to the autocorrelation value at lag 1

The following estimates \(\rho\) and plots the complete autocorrelation function for the residuals:

m1.acf <- acf_resid(m1)

This function estimates the correlation of residuals with themselves in the past at different lags. We take the value as lag 1 as our estimate for \(\rho\)

rho <- m1.acf[2]
# at index 1 we have lag 0, whose value is 1 by definition
# at index 2 we have lag 1, which is what we need
rho
##         1 
## 0.8045908

Now we re-run the GAMM. We set \(\rho\) by specifying the argument rho. We also need to indicate where the start of curves in the data are, otherwise the AR1 model will be also applied between the last and the first sample of curves (argument AR.start).

m1.ar1 <- bam(y ~ s(time, k = 20), data = curves,
              rho = rho, AR.start = curves$time == 0
              )
summary(m1.ar1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(time, k = 20)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.179576   0.001376   130.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(time) 12.43  15.13 804.6  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.915   Deviance explained = 91.5%
## fREML = -21706  Scale est. = 0.0020926  n = 9950
plot_smooth(m1.ar1, view = "time", col = "slateblue4", rug = FALSE, print.summary = FALSE)
acf_resid(m1.ar1)
gam.check(m1.ar1)
## 
## Method: fREML   Optimizer: perf newton
## full convergence after 7 iterations.
## Gradient range [-1.51787e-05,1.483733e-05]
## (score -21705.86 & scale 0.002092649).
## Hessian positive definite, eigenvalue range [6.899551,4974.007].
## Model rank =  20 / 20 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##           k'  edf k-index p-value
## s(time) 19.0 12.4    1.01    0.86

The results show that:

  • The strong positive AR1 term is gone
  • but an AR pattern is still there
  • heteroscedasticity is still there too
  • the estimated smooth has a wider confidence band
  • the AR1-augmented model converged quite quickly

Curve-specific random factors

The way to explicitly represent the information about curves, i.e. which sample belongs to which curve, is to introduce a random smooth factor at the curve level:

m1.randCurves <- bam(y ~ s(time, k = 20) + 
                       + s(time, curveId, bs = "fs", m=1, k = 15)
                       , nthreads = 2
                       , data = curves)
## Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
## 1-d smooths of same variable.
summary(m1.randCurves)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(time, k = 20) + +s(time, curveId, bs = "fs", m = 1, k = 15)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.179574   0.004535    39.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df      F p-value    
## s(time)          17.69   18.6 671.54  <2e-16 ***
## s(time,curveId) 505.63  749.0  56.77  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.984   Deviance explained = 98.5%
## fREML = -24077  Scale est. = 0.00039985  n = 9950
plot_smooth(m1.randCurves, view = "time", col = "slateblue4", rug = FALSE, print.summary = FALSE)
op <- par(mfrow=c(2,2))
gam.check(m1.randCurves)
## 
## Method: fREML   Optimizer: perf newton
## full convergence after 7 iterations.
## Gradient range [-1.198123e-07,1.173967e-07]
## (score -24076.94 & scale 0.0003998499).
## Hessian positive definite, eigenvalue range [8.57444,4985.08].
## Model rank =  770 / 770 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                    k'   edf k-index p-value
## s(time)          19.0  17.7    1.02    0.84
## s(time,curveId) 750.0 505.6    1.02    0.87
acf_resid(m1.randCurves)
par(op)

The results show that:

  • There is no AR residual anymore
  • residuals look uncorrelated, no pattern visible
  • the estimated smooth has a wider confidence band
  • the curve level random smooth-augmented model takes more time to converge

As alternative to the mgcv library, you can try the sparseFLMM library, which incorporates curve-level random smooths by default and it performs an efficient estimation based on functional PCA. sparseFLMM is (in my opinion) generally less user-friendly than mgcv.

AR1 residuals and latent factors

This is a curve dataset with a 4-level factor F4. An AR1 residual with coefficient \(\rho =\) 0.9 is added to the 4 expected curves.

Let us first model it as: y ~ F4 + s(t, by = F4), i.e. a regular GAM with one factor and no AR1 residual.

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ F4 + s(t, by = F4, k = 10)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.499089   0.001344 371.276   <2e-16 ***
## F4A.2        0.482697   0.001901 253.909   <2e-16 ***
## F4B.1       -0.003519   0.001901  -1.851   0.0643 .  
## F4B.2        0.456722   0.001901 240.246   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df     F p-value    
## s(t):F4A.1 8.622  8.959 13909  <2e-16 ***
## s(t):F4A.2 8.819  8.990 15343  <2e-16 ***
## s(t):F4B.1 8.721  8.977 14070  <2e-16 ***
## s(t):F4B.2 8.910  8.998  7570  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.993   Deviance explained = 99.3%
## fREML = -6933.4  Scale est. = 0.0018432  n = 4080

## 
## Method: fREML   Optimizer: perf newton
## full convergence after 8 iterations.
## Gradient range [-2.118151e-05,2.040069e-05]
## (score -6933.404 & scale 0.001843157).
## Hessian positive definite, eigenvalue range [3.696508,2036.03].
## Model rank =  40 / 40 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##              k'  edf k-index p-value
## s(t):F4A.1 9.00 8.62    1.03    0.99
## s(t):F4A.2 9.00 8.82    1.03    0.99
## s(t):F4B.1 9.00 8.72    1.03    0.99
## s(t):F4B.2 9.00 8.91    1.03    0.99

Notice that:

  • The model explains most of the variance
  • Residual plots look good
  • The AR1 pattern is clear, and it looks like an exponential decay

Let us add the AR1 term to the model, taking the estimated value \(\hat{\rho} =\) 0.878893:

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ F4 + s(t, by = F4, k = 10)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.498847   0.004664 106.953   <2e-16 ***
## F4A.2        0.483057   0.006596  73.231   <2e-16 ***
## F4B.1       -0.003197   0.006596  -0.485    0.628    
## F4B.2        0.457320   0.006596  69.328   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df    F p-value    
## s(t):F4A.1 8.618  8.968 1815  <2e-16 ***
## s(t):F4A.2 8.816  8.992 2066  <2e-16 ***
## s(t):F4B.1 8.722  8.983 1837  <2e-16 ***
## s(t):F4B.2 8.912  8.998 1624  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.993   Deviance explained = 99.3%
## fREML = -10094  Scale est. = 0.0016863  n = 4080

## 
## Method: fREML   Optimizer: perf newton
## full convergence after 6 iterations.
## Gradient range [-3.567584e-05,3.518998e-05]
## (score -10093.55 & scale 0.001686262).
## Hessian positive definite, eigenvalue range [3.762987,2036.03].
## Model rank =  40 / 40 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##              k'  edf k-index p-value
## s(t):F4A.1 9.00 8.62    1.03    0.99
## s(t):F4A.2 9.00 8.82    1.03    0.99
## s(t):F4B.1 9.00 8.72    1.03    0.98
## s(t):F4B.2 9.00 8.91    1.03    0.98

Notice that:

  • Explained var almost the same as without AR1
  • Residual plots still look good
  • The AR1 pattern is gone

Let us now change two things:

  1. Pretend we do not know there are 4 levels, but only two levels A and B
  2. Remove the AR1 noise and introduce uncorrelated noise

Let us model this dataset as: y ~ F2 + s(t, by = F2)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ F2 + s(t, by = F2, k = 10)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.744812   0.006513 114.353   <2e-16 ***
## F2B         -0.017690   0.009211  -1.921   0.0549 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F p-value    
## s(t):F2A 5.509  6.652 754.4  <2e-16 ***
## s(t):F2B 6.604  7.734 432.8  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.672   Deviance explained = 67.3%
## fREML = 826.23  Scale est. = 0.086543  n = 4080

## 
## Method: fREML   Optimizer: perf newton
## full convergence after 8 iterations.
## Gradient range [-5.650425e-07,4.193194e-07]
## (score 826.2282 & scale 0.08654253).
## Hessian positive definite, eigenvalue range [2.159394,2038.006].
## Model rank =  20 / 20 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##            k'  edf k-index p-value    
## s(t):F2A 9.00 5.51     0.1  <2e-16 ***
## s(t):F2B 9.00 6.60     0.1  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that:

  • Explained var dropped significantly
  • Residual plots are awful, obvious patterns, even trajectories are visible
  • There is a very large AR pattern, which does not look like an exponential decay

Let us blindly apply the AR1 recipe, i.e. take the estimated \(\hat{\rho} =\) 0.9912984 and re-fit the model:

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ F2 + s(t, by = F2, k = 10)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.74466    0.03844  19.372   <2e-16 ***
## F2B         -0.01751    0.05436  -0.322    0.747    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df   F p-value    
## s(t):F2A 8.186  8.868 194  <2e-16 ***
## s(t):F2B 8.658  8.976 254  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.672   Deviance explained = 67.3%
## fREML = -7762.9  Scale est. = 0.06822   n = 4080

## 
## Method: fREML   Optimizer: perf newton
## full convergence after 8 iterations.
## Gradient range [-3.599553e-09,3.34515e-09]
## (score -7762.912 & scale 0.06821957).
## Hessian positive definite, eigenvalue range [3.542336,2038.014].
## Model rank =  20 / 20 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##            k'  edf k-index p-value    
## s(t):F2A 9.00 8.19     0.1  <2e-16 ***
## s(t):F2B 9.00 8.66     0.1  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that:

  • Explained var did not improve
  • Residual plots are still awful, trajectories are still there
  • AR has changed, but has not gone away

Let us instead add a curve-specific random smooth term (and remove the AR1 term):

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ F2 + s(t, by = F2, k = 10) + s(t, curveId, by = F2, bs = "fs", 
##     m = 1)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.74481    0.03880   19.20   <2e-16 ***
## F2B         -0.01769    0.05353   -0.33    0.741    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                      edf  Ref.df       F p-value    
## s(t):F2A           8.064   8.174   72.45  <2e-16 ***
## s(t):F2B           8.590   8.643  112.88  <2e-16 ***
## s(t,curveId):F2A 349.018 359.000 1248.35  <2e-16 ***
## s(t,curveId):F2B 348.065 359.000 1237.87  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.999   Deviance explained = 99.9%
## fREML =  -8261  Scale est. = 0.00039225  n = 4080

## 
## Method: fREML   Optimizer: perf newton
## full convergence after 17 iterations.
## Gradient range [-0.0001062954,8.748938e-05]
## (score -8260.963 & scale 0.0003922519).
## Hessian positive definite, eigenvalue range [3.40806,2063.484].
## Model rank =  1460 / 1460 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                      k'    edf k-index p-value
## s(t):F2A           9.00   8.06       1    0.59
## s(t):F2B           9.00   8.59       1    0.56
## s(t,curveId):F2A 720.00 349.02       1    0.58
## s(t,curveId):F2B 720.00 348.07       1    0.56

Notice that:

  • Explained var reaches ceiling
  • Residual plots are prefect
  • AR is minimal
  • Random smooths clearly are not random, they capture the latent factor
  • Compute time is quite high (minutes, on a toy example with 80 curves)

Take home message

  • Do not apply AR1 recipe blindly
  • AC can be a consequence of a latent factor/variable
  • Yet the computational cost of the ‘proper’ solution (curve-specific smooths) is often prohibitively high